-
Notifications
You must be signed in to change notification settings - Fork 269
Port of Ernie4 5 #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Port of Ernie4 5 #348
Conversation
Hi! Thanks for creating the updated model copy — really appreciate it! Could I ask how you managed to add the missing tokenizer.json file to smdesai/ERNIE-4.5-0.3B-PT-bf16? Would love to learn from your process — thanks in advance! |
@@ -176,7 +176,7 @@ class LLMEvaluator { | |||
|
|||
/// This controls which model loads. `qwen2_5_1_5b` is one of the smaller ones, so this will fit on | |||
/// more devices. | |||
let modelConfiguration = LLMRegistry.qwen3_1_7b_4bit | |||
let modelConfiguration = LLMRegistry.ernie4503BPTbf16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended not to modify here.
#302 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnmai-dev Thanks for heads up on the recommendation, I've reverted it. As for generating tokenizer.json, this is the python script I used. I had downloaded the model prior so I used the last example.
from transformers import AutoTokenizer
import json
import os
def convert_tokenizer_model_to_json(model_path, output_path=None):
"""
Convert a tokenizer.model file to tokenizer.json format.
Args:
model_path: Path to the tokenizer.model file or directory containing it
output_path: Optional output path for tokenizer.json (defaults to same directory)
"""
# Handle both file and directory paths
if os.path.isdir(model_path):
tokenizer_model_path = os.path.join(model_path, "tokenizer.model")
else:
tokenizer_model_path = model_path
model_path = os.path.dirname(model_path)
if not os.path.exists(tokenizer_model_path):
raise FileNotFoundError(f"tokenizer.model not found at {tokenizer_model_path}")
tokenizer = AutoTokenizer.from_pretrained(model_path)
if output_path is None:
output_path = model_path
tokenizer.save_pretrained(output_path)
tokenizer_json_path = os.path.join(output_path, "tokenizer.json")
if os.path.exists(tokenizer_json_path):
print(f"Successfully created tokenizer.json at {tokenizer_json_path}")
else:
print("Warning: tokenizer.json was not created. The tokenizer might not support this format.")
return tokenizer_json_path
# Example usage
if __name__ == "__main__":
# Example: Convert a tokenizer.model file
# convert_tokenizer_model_to_json("/path/to/tokenizer.model")
# Example: Convert from a directory containing tokenizer.model
# convert_tokenizer_model_to_json("/path/to/model/directory")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! @smdesai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @johnmai-dev I see the ERNIE model in mlx-community was added by you. Any chance you can add tokenizer.json to the models here? I can then change LLMModelFactory to reference the model in mlx-community
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can I download ERNIE-4.5-0.3B-PT-bf16 from this notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The files are identical to the ones here: https://huggingface.co/mlx-community/ERNIE-4.5-0.3B-PT-bf16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it still cannot be generated.
I have added huggingface-cli download
and pip install
commands in your notebook.
Can you try running it with the notebook I provided?
Let me see if there will be any difference in the results when you run it with my notebook.
https://colab.research.google.com/drive/1fAHK6EL8JYsHDo5duJr5llI1YeeK974t?usp=sharing


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I have no idea what's going on here. I tried your notebook and I get the same error as you. I also tried the same changes in my notebook and get the same error (not surprising). So the only thing that works is:
- downloading the model files via huggingface to a local directory and converting
- uploading the model files to Colab and performing the conversion
The tokenizer_config.json you are using is inconsistent with Baidu's. Since you are using LlamaTokenizer, your model can generate tokenizer.json. https://huggingface.co/smdesai/ERNIE-4.5-0.3B-PT-bf16/blob/main/tokenizer_config.json#L9242 https://huggingface.co/baidu/ERNIE-4.5-0.3B-PT/blob/main/tokenizer_config.json#L14 |
Thanks for tracking it down. It seems that when I was initially trying to convert, I did use LlamaTokenizer which reported an error but may have incorrectly modified the tokenizer_config.json. I then switched to AutoTokenizer which performed the conversion (incorrectly). I'm going to look at using tokenization_ernie4_5.py to generate tokenizer.json. |
@johnmai-dev Try this. This keeps the rest of the model files intact creating only tokenizer.json. Running in colab, use this for main(). I also made changes to Tokenizer.swift in MLXCommon to support the T5Tokenizer (Unigram) for Ernie. def main():
convert_sentencepiece_to_tokenizer_json("models/ERNIE-4.5-0.3B-PT-bf16/tokenizer.model", "models/ERNIE-4.5-0.3B-PT-bf16/tokenizer.json") |
This is an MLX Swift port of @johnmai-dev port of Ernie. I'm unable to run the model using mlx-community/ERNIE-4.5-0.3B-PT-bf16 as it's missing tokenizer.json. I've created a copy of the model in smdesai/ERNIE-4.5-0.3B-PT-bf16 which contains the missing tokenizer.json file.